An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication
نویسندگان
چکیده
This article proposes a novel hardware accelerator for the inference task with sparse convolutional neural networks (CNNs) by building unit to perform Image Column ( Im2Col ) transformation of input feature map coupled systolic-array-based general matrix-matrix multiplication (GEMM) unit. Our design carefully overlaps GEMM computation maximize parallelism. We propose that uses set distributed local memories connected ring network, which improves energy efficiency and latency streaming only once. The in can be dynamically configured as multiple units square-shaped systolic arrays or single tall array. dynamic reconfigurability enables effective pipelining operations attains high processing element utilization wide range CNNs. Further, our is sparsity aware, improving performance effectively mapping maps weights elements, skipping ineffectual unnecessary data movements involving zeros. prototype, SPOTS, on average 2.16 \( \times \) , 1.74 1.63 faster than Gemmini, Eyeriss, Sparse-PE, are prior accelerators dense CNNs, respectively. SPOTS also 78 12 more energy-efficient when compared CPU GPU implementations,
منابع مشابه
Sparse Matrix Multiplication on CAM Based Accelerator
Sparse matrix multiplication is an important component of linear algebra computations. In this paper, an architecture based on Content Addressable Memory (CAM) and Resistive Content Addressable Memory (ReCAM) is proposed for accelerating sparse matrix by sparse vector and matrix multiplication in CSR format. Using functional simulation, we show that the proposed ReCAM-based accelerator exhibits...
متن کاملSparse Matrix-Vector Multiplication for the ClearSpeed Accelerator
Sparse matrix-vector multiplication (SpMV), y = A * x, where A is a sparse matrix and x, y are vectors, is a common computational kernel in many application domains that presents challenges for performance optimization. The high ratio of memory operations to computation and the lack of data reuse cause sparse matrix-vector multiplication to be bandwidth intensive. Additionally, the application ...
متن کاملHyper-Systolic Matrix Multiplication
A novel parallel algorithm for matrix multiplication is presented. The hyper-systolic algorithm makes use of a one-dimensional processor abstraction. The procedure can be implemented on all types of parallel systems. It can handle matrix-vector multiplications as well as transposed matrix products.
متن کاملCoded Sparse Matrix Multiplication
In a large-scale and distributed matrix multiplication problem C = AB, where C ∈ Rr×t, the coded computation plays an important role to effectively deal with “stragglers” (distributed computations that may get delayed due to few slow or faulty processors). However, existing coded schemes could destroy the significant sparsity that exists in large-scale machine learning problems, and could resul...
متن کاملFPGA accelerator for floating-point matrix multiplication
This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: ACM Transactions on Architecture and Code Optimization
سال: 2022
ISSN: ['1544-3973', '1544-3566']
DOI: https://doi.org/10.1145/3532863